Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data

نویسندگان

  • Levent Ertöz
  • Michael Steinbach
  • Vipin Kumar
چکیده

The problem of finding clusters in data is challenging when clusters are of widely differing sizes, densities and shapes, and when the data contains large amounts of noise and outliers. Many of these issues become even more significant when the data is of very high dimensionality, such as text or time series data. In this paper we present a novel clustering technique that addresses these issues. Our algorithm first finds the nearest neighbors of each data point and then redefines the similarity between pairs of points in terms of how many nearest neighbors the two points share. Using this new definition of similarity, we eliminate noise and outliers, identify core points, and then build clusters around the core points. The use of a shared nearest neighbor definition of similarity removes problems with varying density, while the use of core points handles problems with shape and size. We experimentally show that our algorithm performs better than traditional methods (e.g., K-means) on a variety of data sets: KDD Cup '99 network intrusion data, NASA Earth science time series data, and two dimensional point sets. While our algorithm can find the “dense” clusters that other clustering algorithms find, it also finds clusters that these approaches overlook, i.e., clusters of low or medium density which are of interest because they represent relatively uniform regions “surrounded” by non-uniform or higher density areas. The run-time complexity of our technique is O(n) since the similarity matrix has to be constructed. However, we discuss a number of optimizations that allow the algorithm to handle large datasets efficiently. For example, 100,000 documents from the TREC collection can be clustered within an hour on a desktop computer.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DDSC : A Density Differentiated Spatial Clustering Technique

Finding clusters with widely differing sizes, shapes and densities in presence of noise and outliers is a challenging job. The DBSCAN is a versatile clustering algorithm that can find clusters with differing sizes and shapes in databases containing noise and outliers. But it cannot find clusters based on difference in densities. We extend the DBSCAN algorithm so that it can also detect clusters...

متن کامل

Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model

A novel clustering algorithm CSHARP is presented for the purpose of finding clusters of arbitrary shapes and arbitrary densities in high dimensional feature spaces. It can be considered as a variation of the Shared Nearest Neighbor algorithm (SNN), in which each sample data point votes for the points in its k-nearest neighborhood. Sets of points sharing a common mutual nearest neighbor are cons...

متن کامل

An Efficient And Scalable Density-Based Clustering Algorithm For Normalize Data

Data clustering is a method of putting same data object into group. A clustering rule does partitions of a data set into many groups supported the principle of maximizing the intra-class similarity and minimizing the inter-class similarity. Finding clusters in object, particularly high dimensional object, is difficult when the clusters are different shapes, sizes, and densities, and when data c...

متن کامل

FAÇADE: A Fast and Effective Approach to the Discovery of Dense Clusters in Noisy Spatial Data

FAÇADE (Fast and Automatic Clustering Approach to Data Engineering) is a spatial clustering tool that can discover clusters of different sizes, shapes, and densities in noisy spatial data. Compared with the existing clustering methods, FAÇADE has several advantages: first, it separates true data and noise more effectively. Second, most steps of FAÇADE are automatic. Third, it requires only O(nl...

متن کامل

DDCT: Detecting Density Differences Using A Novel Clustering Technique

Data clustering plays an important role in various fields. Data clustering approaches have been presented in recent decades. Identifying clusters with widely differing shapes, sizes and densities in the presence of noise and outliers is challenging. Many density-based clustering algorithms, such as DBSCAN, can locate arbitrary shapes, sizes and filter noise, but cannot identify clusters based o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003